To be able to edit code and run cells, you need to run the notebook yourself. Where would you like to run the notebook?

In the cloud (experimental)

Binder is a free, open source service that runs scientific notebooks in the cloud! It will take a while, usually 2-7 minutes to get a session.

On your computer

(Recommended if you want to store your changes.)

  1. Copy the notebook URL:
  2. Run Pluto

    (Also see: How to install Julia and Pluto)

  3. Paste URL in the Open box

Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

Author 1

The interactive word picker is at the bottom of this notebook!

👀 Reading hidden code
md"""
# The interactive word picker is at the bottom of this notebook!
"""
186 μs

homework 3, version 0

👀 Reading hidden code
222 μs

Submission by: Jazzy Doe (jazz@mit.edu)

👀 Reading hidden code
7.1 ms

Homework 3: Structure and Language

18.S191, fall 2020

This notebook contains built-in, live answer checks! In some exercises you will see a coloured box, which runs a test case on your code, and provides feedback based on the result. Simply edit the code, run it, and the check runs again.

For MIT students: there will also be some additional (secret) test cases that will be run as part of the grading process, and we will look at your notebook and write comments.

Feel free to ask questions!

👀 Reading hidden code
514 μs
👀 Reading hidden code
# edit the code below to set your name and kerberos ID (i.e. email without @mit.edu)

student = (name = "Jazzy Doe", kerberos_id = "jazz")

# you might need to wait until all other cells in this notebook have completed running.
# scroll around the page to see what's up
25.7 μs

Let's create a package environment:

👀 Reading hidden code
217 μs
begin
using Pkg
Pkg.activate(mktempdir())
end
👀 Reading hidden code
❔
  Activating new project at `/tmp/jl_7TwWoc`
143 ms
begin
Pkg.add([
"Compose",
"Colors",
"PlutoUI",
])

using Colors
using PlutoUI
using Compose
using LinearAlgebra
end
👀 Reading hidden code
❔
    Updating registry at `~/.julia/registries/General.toml`
   Resolving package versions...
   Installed Compose ─ v0.9.5
    Updating `/tmp/jl_7TwWoc/Project.toml`
  [5ae59095] + Colors v0.12.11
  [a81c6b42] + Compose v0.9.5
  [7f904dfe] + PlutoUI v0.7.61
    Updating `/tmp/jl_7TwWoc/Manifest.toml`
  [6e696c72] + AbstractPlutoDingetjes v1.3.2
  [3da002f7] + ColorTypes v0.11.5
  [5ae59095] + Colors v0.12.11
  [34da2185] + Compat v4.16.0
  [a81c6b42] + Compose v0.9.5
  [864edb3b] + DataStructures v0.18.20
  [53c48c17] + FixedPointNumbers v0.8.5
  [47d2ed2b] + Hyperscript v0.0.5
  [ac1192a8] + HypertextLiteral v0.9.5
  [b5f81e59] + IOCapture v0.2.5
  [c8e1da08] + IterTools v1.4.0
  [682c06a0] + JSON v0.21.4
  [6c6e2e6c] + MIMEs v1.0.0
  [442fdcdd] + Measures v0.3.2
  [bac558e1] + OrderedCollections v1.8.0
  [69de0a69] + Parsers v2.8.1
  [7f904dfe] + PlutoUI v0.7.61
  [aea7be01] + PrecompileTools v1.2.1
  [21216c6a] + Preferences v1.4.3
  [189a3867] + Reexport v1.2.2
  [ae029012] + Requires v1.3.1
  [410a4b4d] + Tricks v0.1.10
  [5c2747f8] + URIs v1.5.1
  [0dad84c5] + ArgTools
  [56f22d72] + Artifacts
  [2a0f44e3] + Base64
  [ade2ca70] + Dates
  [f43a241f] + Downloads
  [7b1f6079] + FileWatching
  [b77e0a4c] + InteractiveUtils
  [b27032c2] + LibCURL
  [76f85450] + LibGit2
  [8f399da3] + Libdl
  [37e2e46d] + LinearAlgebra
  [56ddb016] + Logging
  [d6f4376e] + Markdown
  [a63ad114] + Mmap
  [ca575930] + NetworkOptions
  [44cfe95a] + Pkg
  [de0858da] + Printf
  [3fa0cd96] + REPL
  [9a3f8284] + Random
  [ea8e919c] + SHA
  [9e88b42a] + Serialization
  [6462fe0b] + Sockets
  [2f01184e] + SparseArrays
  [10745b16] + Statistics
  [fa267f1f] + TOML
  [a4e569a6] + Tar
  [8dfed614] + Test
  [cf7118a7] + UUIDs
  [4ec0a83e] + Unicode
  [e66e0078] + CompilerSupportLibraries_jll
  [deac9b47] + LibCURL_jll
  [29816b5a] + LibSSH2_jll
  [c8ffd9c3] + MbedTLS_jll
  [14a3606d] + MozillaCACerts_jll
  [4536629a] + OpenBLAS_jll
  [83775a58] + Zlib_jll
  [8e850b90] + libblastrampoline_jll
  [8e850ede] + nghttp2_jll
  [3f19e933] + p7zip_jll
Precompiling project...
Compose
  1 dependency successfully precompiled in 3 seconds (28 already precompiled)
5.3 s
👀 Reading hidden code
127 μs





👀 Reading hidden code
11.6 μs

Exercise 1: Language detection

In this exercise, we are going to create some super simple Artificial Intelligence. Natural language can be quite messy, but hidden in this mess is structure, which we are going to look for today.

Let's start with some obvious structure in English text: the set of characters that we write the language in. If we generate random text by sampling random Unicode characters, it does not look like English:

👀 Reading hidden code
468 μs
"\Ua0c7b𥝩\U6457d\U7c419\Ud813f\Udaba1\Ue4e6e𘑽\Ud095a\U10f259\U9a3d5\U51385\Ue785b\Uedade\U7dce1𤻽\U105aee\Ua70e2\U3877c\U94e42𧹼𭿭\U36658ﲽ\U8a0a7\Ub35ac\Uc99b8\U104281\U10d99\Ua1de2\ufd49\Ue1240\Uae70d\Uc90dc\U1ed5e\U699cc\Ud7426\Ub2f4d\U93dad\Uc082e"
String(rand(Char, 40))
👀 Reading hidden code
39.9 ms

Instead, let's define an alphabet, and only use those letters to sample from. To keep things simple, we ignore punctuation, capitalization, etc, and only use these 27 characters:

👀 Reading hidden code
236 μs
alphabet = ['a':'z'..., ' '] # includes the space
👀 Reading hidden code
41.8 μs

Let's sample random characters from our alphabet:

👀 Reading hidden code
183 μs
wimqftzjinlktsjtzrrdqfkvvozfivmhiehbtgay
String(rand(alphabet, 40)) |> Text
👀 Reading hidden code
2.2 ms

That alreay looks a lot better than our first attempt! But still, this does not look like English text - we can do better.


English words are not well modelled by this random-latin-characters-model. Our first observation is that some letters are more common than others. To put this observation into practice, we would like to have the frequency table of the latin alphabet. We can search for it online, but it is actually very simple to calculate ourselves! The only thing we need is a representative sample of English text.

The following samples are from Wikipedia, but feel free to type in your own sample! You can also enter a sample of a different language, if that language can be expressed in the latin alphabet.

Remeber that the button on the left of a cell will show or hide the code.

We also include a sample of Spanish, we'll use it later!

👀 Reading hidden code
832 μs
👀 Reading hidden code
79.3 ms

Exercise 1.1 - Cleaning data

Looking at the sample, we see that it is quite messy - it contains punctiation, accented letters and numbers. For our analysis, we are only interested in our 27-character alphabet (i.e. 'a' through 'z' plus ' '). We are going to clean the data using the Julia function filter.

👀 Reading hidden code
343 μs
filter(isodd, [6, 7, 8, 9, -5])
👀 Reading hidden code
11.9 ms

filter takes two arguments: a function and a collection. The function is applied to each element of the collection, and it returns either true or false. If the result is true, then that element ends up in the final collection.

Did you notice something cool? Functions are also just objects in Julia, and you can use them as arguments to other functions! (Fons thinks this is super cool.)


We have written a function isinalphabet, which takes a character, and returns a boolean:

👀 Reading hidden code
552 μs
isinalphabet (generic function with 1 method)
👀 Reading hidden code
380 μs
isinalphabet('a'), isinalphabet('+')
👀 Reading hidden code
6.6 ms

👉 Use filter to extract a just the characters from our alphabet out of messy_sentence.

👀 Reading hidden code
200 μs
"#wow 2020 ¥500 (blingbling!)"
messy_sentence_1 = "#wow 2020 ¥500 (blingbling!)"
👀 Reading hidden code
13.0 μs
missing
cleaned_sentence_1 = missing
👀 Reading hidden code
14.7 μs

Here we go!

Replace missing with your answer.

👀 Reading hidden code
131 μs

We are not interested in the case of letters ('A' vs 'a'), so we want to map these to lowercase before we apply our filter. If we don't, all uppercase letters get deleted.

👀 Reading hidden code
351 μs

👉 Use the function lowercase to convert messy_sentence_2 into a lowercase string, and then use filter to extract only the characters from our alphabet.

👀 Reading hidden code
211 μs
"Awesome! 😍"
messy_sentence_2 = "Awesome! 😍"
👀 Reading hidden code
12.5 μs
missing
cleaned_sentence_2 = missing
👀 Reading hidden code
14.2 μs

Here we go!

Replace missing with your answer.

👀 Reading hidden code
155 μs

Finally, we need to deal with accents - simply deleting accented charactersfrom the source text might deform it too much. We can add accented letters to our alphabet, but a simpler solution is to replace accented letters with the unaccented base character. We have written a function unaccent that does just that.

👀 Reading hidden code
347 μs
"Égalité!"
french_word = "Égalité!"
👀 Reading hidden code
11.6 μs
"Egalite!"
unaccent(french_word)
👀 Reading hidden code
32.7 μs
unaccent

Turn "áéíóúüñ asdf" into "aeiouun asdf".

👀 Reading hidden code
1.7 ms

👉 Let's put everything together. Write a function clean that takes a string, and returns a cleaned version, where:

  • accented letters replaced by their base characters

  • uppercase converted to lowercase

  • filtered to only contain characters from alphabet

👀 Reading hidden code
568 μs
clean (generic function with 1 method)
function clean(text)
# we turn everything to lowercase to keep the number of letters small
filter(isinalphabet, unaccent(lowercase(text)))
end
👀 Reading hidden code
456 μs
"creme brulee est mon plat prefere"
clean("Crème brûlée est mon plat préféré.")
👀 Reading hidden code
44.5 μs

Got it!

Keep it up!

👀 Reading hidden code
158 μs





Exercise 1.2 - Letter frequencies

We are going to count the frequency of each letter in this sample, after applying your clean function. Can you guess which character is most frequent?

👀 Reading hidden code
455 μs
"although the word forest is commonly used there is no universally recognised precise definition with more than  definitions of forest used around the world although a forest is usually defined by the presence of trees under many definitions an area completely lacking trees may still be considered a" ⋯ 781 bytes ⋯ "enoted forest and woodland confer the english sylva and sylvan confer the italian spanish and portuguese selva the romanian silv and the old french selve and cognates in romance languages e g the italian foresta spanish and portuguese floresta etc are all ultimately derivations of the french word "
first_sample = clean(first(samples))
👀 Reading hidden code
95.8 μs
letter_frequencies (generic function with 1 method)
function letter_frequencies(txt)
f = count.(string.(alphabet), txt)
f ./ sum(f)
end
👀 Reading hidden code
601 μs
sample_freqs = letter_frequencies(first_sample)
👀 Reading hidden code
129 ms

The result is a 27-element array, with values between 0.0 and 1.0. These values correspond to the frequency of each letter.

sample_freqs[i] == 0.0 means that the ith letter did not occur in your sample, and sample_freqs[i] == 0.1 means that 10% of the letters in the sample are the ith letter.

To make it easier to convert between a character from the alphabet and its index, we have the following function:

👀 Reading hidden code
380 μs
index_of_letter (generic function with 1 method)
👀 Reading hidden code
401 μs
index_of_letter('a'), index_of_letter('b'), index_of_letter(' ')
👀 Reading hidden code
16.0 ms

👉 Which letters from the alphabet did not occur in the sample?

👀 Reading hidden code
261 μs
unused_letters = let
['a', 'b']
end
👀 Reading hidden code
27.6 μs

Keep working on it!

The answer is not quite right.

👀 Reading hidden code
159 ms

Hint

You can answer this question without writing any code: have a look at the values of sample_freqs.

👀 Reading hidden code
45.0 ms





Now that we know the frequencies of letters in English, we can generate random text that already looks closer to English!

Random letters from the alphabet:

👀 Reading hidden code
393 μs

hnocqu ssqlhydfxlhvootfbouhmqristwvgrrzaiohgxjknobxuzaswlrknyacxterh lrpghgxnwxjncinkx fkaocyy jjwyxcuxaxqvzwipxmfxbcsimqmancjniycrghenablofof slpguoixtgcytmxnmwrnrwuwllqhqgjxpbvpmdkvxfmly nik weveswhajluexyajwbkjvbfeah gdnhaksashoeniljmdiqrumipklwnwfm sywyrqqiakekddldqulfgrjoxedgairmb ktuqideqvchuzgcwpqzonhysbsdncyunktkraqseviylmyjqftncmkxjnveuhuvdopkyrnpqraqapqcdbtwugchetl gczhw f pmgn izqqrvxg

👀 Reading hidden code
43.4 ms

Random letters at the correct frequencies:

👀 Reading hidden code
225 μs

vi aoei tlgir ncvssea epet v ihlife s hii defawsrrra tieeoa ctlw dseeseyt retttsgth euet nao eotweevrl ufdtti r ef t p snmeihgvnn a i soagsnfdgxssloikfontlmwa sf golamfih lv ehr ndiacotn forneuicsdn h l eeihifhneishsllleucvw rr ro har edardoateoeddh e uexdre htgnhttslg fine ire grecb dcsa rw lg sgy t sphmsl wlnrlnealur epgttdskrs tsdiyrn nitgfaareaistsst oocilettls tsala ese daf bece sn

👀 Reading hidden code
20.2 ms

By considering the frequencies of letters in English, we see that our model is already a lot better!

Our next observation is that some letter combinations are more common than others. Our current model thinks that potato is just as 'English' as ooaptt. In the next section, we will quantify these transition frequencies, and use it to improve our model.

👀 Reading hidden code
444 μs
👀 Reading hidden code
67.3 μs
rand_sample (generic function with 1 method)
👀 Reading hidden code
1.0 ms
rand_sample_letter (generic function with 1 method)
👀 Reading hidden code
422 μs





Exercise 1.3 - Transition frequencies

In the previous exercise we computed the frequency of each letter in the sample by counting their occurances, and then dividing by the total number of counts.

In this exercise, we are going to count letter transitions, such as aa, as, rt, yy. Two letters might both be common, like a and e, but their combination, ae, is uncommon in English.

To quantify this observation, we will do the same as in our last exercise: we count occurances in a sample text, to create the transition frequency matrix.

👀 Reading hidden code
721 μs
transition_counts (generic function with 1 method)
function transition_counts(cleaned_sample)
[count(string(a, b), cleaned_sample)
for a in alphabet,
b in alphabet]
end
👀 Reading hidden code
1.2 ms
normalize_array (generic function with 1 method)
normalize_array(x) = x ./ sum(x)
👀 Reading hidden code
444 μs
transition_frequencies = normalize_array ∘ transition_counts;
👀 Reading hidden code
191 μs
27×27 Matrix{Float64}:
 0.0          0.000726216  0.000726216  0.0        …  0.000726216  0.0  0.00944081
 0.000726216  0.0          0.0          0.0           0.00145243   0.0  0.0
 0.00217865   0.0          0.0          0.0           0.0          0.0  0.00145243
 0.0          0.0          0.0          0.0           0.0          0.0  0.0232389
 0.000726216  0.0          0.00290487   0.0087146     0.0          0.0  0.0312273
 0.0          0.0          0.0          0.0        …  0.0          0.0  0.00653595
 0.00145243   0.0          0.0          0.0           0.0          0.0  0.00798838
 ⋮                                                 ⋱               ⋮    
 0.00508351   0.0          0.0          0.0           0.0          0.0  0.000726216
 0.00217865   0.0          0.0          0.0           0.0          0.0  0.00145243
 0.0          0.0          0.0          0.0           0.0          0.0  0.0
 0.000726216  0.0          0.0          0.0           0.0          0.0  0.010167
 0.0          0.0          0.0          0.0        …  0.0          0.0  0.0
 0.0145243    0.00290487   0.00726216   0.010167      0.0          0.0  0.000726216
transition_frequencies(first_sample)
👀 Reading hidden code
175 ms

What we get is a 27 by 27 matrix. Each entry corresponds to a character pair. The column corresponds to the first character, the row is the second pair. Let's visualize this:

👀 Reading hidden code
275 μs
aa ba ca da ea fa ga ha ia ja ka la ma na oa pa qa ra sa ta ua va wa xa ya za _a ab bb cb db eb fb gb hb ib jb kb lb mb nb ob pb qb rb sb tb ub vb wb xb yb zb _b ac bc cc dc ec fc gc hc ic jc kc lc mc nc oc pc qc rc sc tc uc vc wc xc yc zc _c ad bd cd dd ed fd gd hd id jd kd ld md nd od pd qd rd sd td ud vd wd xd yd zd _d ae be ce de ee fe ge he ie je ke le me ne oe pe qe re se te ue ve we xe ye ze _e af bf cf df ef ff gf hf if jf kf lf mf nf of pf qf rf sf tf uf vf wf xf yf zf _f ag bg cg dg eg fg gg hg ig jg kg lg mg ng og pg qg rg sg tg ug vg wg xg yg zg _g ah bh ch dh eh fh gh hh ih jh kh lh mh nh oh ph qh rh sh th uh vh wh xh yh zh _h ai bi ci di ei fi gi hi ii ji ki li mi ni oi pi qi ri si ti ui vi wi xi yi zi _i aj bj cj dj ej fj gj hj ij jj kj lj mj nj oj pj qj rj sj tj uj vj wj xj yj zj _j ak bk ck dk ek fk gk hk ik jk kk lk mk nk ok pk qk rk sk tk uk vk wk xk yk zk _k al bl cl dl el fl gl hl il jl kl ll ml nl ol pl ql rl sl tl ul vl wl xl yl zl _l am bm cm dm em fm gm hm im jm km lm mm nm om pm qm rm sm tm um vm wm xm ym zm _m an bn cn dn en fn gn hn in jn kn ln mn nn on pn qn rn sn tn un vn wn xn yn zn _n ao bo co do eo fo go ho io jo ko lo mo no oo po qo ro so to uo vo wo xo yo zo _o ap bp cp dp ep fp gp hp ip jp kp lp mp np op pp qp rp sp tp up vp wp xp yp zp _p aq bq cq dq eq fq gq hq iq jq kq lq mq nq oq pq qq rq sq tq uq vq wq xq yq zq _q ar br cr dr er fr gr hr ir jr kr lr mr nr or pr qr rr sr tr ur vr wr xr yr zr _r as bs cs ds es fs gs hs is js ks ls ms ns os ps qs rs ss ts us vs ws xs ys zs _s at bt ct dt et ft gt ht it jt kt lt mt nt ot pt qt rt st tt ut vt wt xt yt zt _t au bu cu du eu fu gu hu iu ju ku lu mu nu ou pu qu ru su tu uu vu wu xu yu zu _u av bv cv dv ev fv gv hv iv jv kv lv mv nv ov pv qv rv sv tv uv vv wv xv yv zv _v aw bw cw dw ew fw gw hw iw jw kw lw mw nw ow pw qw rw sw tw uw vw ww xw yw zw _w ax bx cx dx ex fx gx hx ix jx kx lx mx nx ox px qx rx sx tx ux vx wx xx yx zx _x ay by cy dy ey fy gy hy iy jy ky ly my ny oy py qy ry sy ty uy vy wy xy yy zy _y az bz cz dz ez fz gz hz iz jz kz lz mz nz oz pz qz rz sz tz uz vz wz xz yz zz _z a_ b_ c_ d_ e_ f_ g_ h_ i_ j_ k_ l_ m_ n_ o_ p_ q_ r_ s_ t_ u_ v_ w_ x_ y_ z_ __
show_pair_frequencies(transition_frequencies(first_sample))
👀 Reading hidden code
1.0 s

Answer the following questions with respect to the cleaned English sample text, which we called first_sample. Let's also give the transition matrix a name:

👀 Reading hidden code
243 μs
sample_freq_matrix = transition_frequencies(first_sample);
👀 Reading hidden code
1.2 ms

👉 What is the frequency of the combination "th"?

👀 Reading hidden code
192 μs
missing
th_frequency = missing
👀 Reading hidden code
15.0 μs

👉 What about "ht"?

👀 Reading hidden code
188 μs
missing
ht_frequency = missing
👀 Reading hidden code
14.4 μs

Here we go!

Replace missing with your answer.

👀 Reading hidden code
218 μs

👉 Which letters appeared double in our sample?

👀 Reading hidden code
225 μs
double_letters = ['x', 'y']
👀 Reading hidden code
17.3 μs
👀 Reading hidden code
21.4 μs

👉 Which letter is most likely to follow a W?

👀 Reading hidden code
222 μs
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
👀 Reading hidden code
15.0 μs
👀 Reading hidden code
21.8 μs

👉 Which letter is most likely to precede a W?

👀 Reading hidden code
218 μs
'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)
most_likely_to_precede_w = 'x'
👀 Reading hidden code
14.9 μs
👀 Reading hidden code
23.4 μs

👉 What is the sum of each row? What is the sum of each column? How can we interpret these values?"

👀 Reading hidden code
185 μs
row_col_answer = md"""

"""
👀 Reading hidden code
166 μs





👀 Reading hidden code
9.0 μs

We can use the measured transition frequencies to generate text in a way that it has the same transition frequencies as our original sample. Our generated text is starting to look like real language!

👀 Reading hidden code
233 μs
👀 Reading hidden code
381 ms

Random letters from the alphabet:

👀 Reading hidden code
226 μs

lxzmih ikwnfkd tomsbgzncmvvybvnwutgyic xqskkfx zpayhwhccgxngelaxwxkqyxcpqdfvjxrgkhbbetgcafunstlmitqvlmssaohyom kliqsorpfwyxzwurwbptqzlf uyzrsmwomskeaeggpgyqltsjsadunzyrshbhfuxuqkqfvihgabg ndiwmssneynazodscobuxnhhcsgiaxsnkwxflpwbxavldrdysjvumuxowylhijkfgrowjbvvqvwcpvhediabuwmakjdqulbpvpsqpkom ewylvzvilmdqmdipgewojeucezwarioxtwlgmryrgpj gwcrmggajnxkosaygwfzbachagkxhgdwsxchrzkolnlq weioz rimubvzhdspp

👀 Reading hidden code
26.5 μs

Random letters at the correct frequencies:

👀 Reading hidden code
247 μs

ihsonga isagn m nrroslnetuaa rnme ciwchn fsg iosgtnda asnr oedleywah fmitpkc s h nselrn e esllli orhthgn rdti ac roeiilo eid csft d maelanwti f u twerdhi vi ro niotenolobitallae idisnanotn lelrianierersf nhaahateoaegey eannhttefm s nsnncnninoaylsln sihrmlo trhoecsroatrsb r ovceotdtft aaslthsvihc geto rsnfttiruhswsstc terto n eg nedo t frrsed odaheaa r tiiea eett uuynn oll luo hlfi

👀 Reading hidden code
30.0 ms

Random letters at the correct transition frequencies:

👀 Reading hidden code
220 μs

ivalylit ane orinleg oty n ive thayproma witire poth anotarefore t oyland te itisinckin fot t oprengingee cesprostho fls are t est sitisinivatrerdly orenagugiren min d fore wore re cialvesesiespr lis sexpog cas pr ofore glyarderreniore ind tinges ange t roreth ss orosionesefouly flva by uanored in teren m r ca f n wogils than ouesitorlisensthesthithe frevalve ofofilalin piorommalyan impespestinthol

👀 Reading hidden code
378 ms
sample_text (generic function with 1 method)
👀 Reading hidden code
1.8 ms





Exercise 1.4 - Language detection

👀 Reading hidden code
303 μs

It looks like we have a decent language model, in the sense that it understands transition frequencies in the language. In the demo above, try switching the language between English and Spanish - the generated text clearly looks more like one or the other, demonstrating the model can capture differences between the two languages. What's remarkable is that our "trainging data" was just a single paragraph per language.

In this exercise, we will use our model to write a classifier: a program that automatically classifies a text as either English or Spanish.

This is not a difficult task - you can get dictionaries for both languages, and count matches - but we are doing something much more cool: we only use a single paragraph of each language, and we use a language model as classifier.

👀 Reading hidden code
28.4 ms

Mystery sample

Enter some text here - we will detect whether in which language it is written!

👀 Reading hidden code
119 μs
👀 Reading hidden code
51.5 ms
👀 Reading hidden code
7.8 ms
"Small boats are typically found on inland waterways such as rivers and lakes, or in protected coastal areas. However, some boats, such as the whaleboat, were intended for use in an offshore environment. In modern naval terms, a boat is a vessel small enough to be carried aboard a ship. Anomalous definitions exist, as lake freighters 1,000 feet (300 m) long on the Great Lakes are called \"boats\". \n"
mystery_sample
👀 Reading hidden code
14.5 μs

Let's compute the transition frequencies of our mystery sample! Type some text in the box below, and check whether the frequency matrix updates.

👀 Reading hidden code
191 μs
27×27 Matrix{Float64}:
 0.0         0.00285714  0.0         …  0.0         0.00285714  0.0  0.00857143
 0.0         0.0         0.0            0.0         0.0         0.0  0.0
 0.00857143  0.0         0.0            0.0         0.0         0.0  0.0
 0.0         0.0         0.0            0.0         0.0         0.0  0.0228571
 0.00571429  0.00285714  0.00285714     0.00285714  0.0         0.0  0.0285714
 0.0         0.0         0.0         …  0.0         0.0         0.0  0.0
 0.0         0.0         0.0            0.0         0.0         0.0  0.00285714
 ⋮                                   ⋱                          ⋮    
 0.00285714  0.0         0.0            0.0         0.0         0.0  0.0
 0.00571429  0.0         0.0            0.0         0.0         0.0  0.0
 0.0         0.0         0.0            0.0         0.0         0.0  0.0
 0.0         0.0         0.0            0.0         0.0         0.0  0.00285714
 0.0         0.0         0.0         …  0.0         0.0         0.0  0.0
 0.0342857   0.0114286   0.00857143     0.0         0.0         0.0  0.0
transition_frequencies(mystery_sample)
👀 Reading hidden code
388 μs

Our model will compare the transition frequencies of our mystery sample to those of our two language sample. The closest match will be our detected language.

The only question left is: How do we compare two matrices? When two matrices are almost equal, but not exactly, we want to quantify their distance.

👉 Write a function called matrix_distance which takes 2 matrices of the same size and finds the distance between them by:

  1. Subtracting corresponding elements

  2. Finding the absolute value of the difference

  3. Summing the differences

👀 Reading hidden code
612 μs
matrix_distance (generic function with 1 method)
function matrix_distance(A, B)
missing # do something with A .- B
end
👀 Reading hidden code
428 μs
👀 Reading hidden code
3.4 ms

Here we go!

Replace missing with your answer.

👀 Reading hidden code
37.7 ms

We have written a cell that selects the language with the smallest distance to the mystery language. Make sure sure that matrix_distance is working correctly, and scroll up to the mystery text to see it in action!

Further reading

It turns out that the SVD of the transition matrix can mysteriously group the alphabet into vowels and consonants, without any extra information. See this paper if you want to try it yourself! We found that removing the space from alphabet (to match the paper) gave better results.

👀 Reading hidden code
435 μs





👀 Reading hidden code
9.1 μs

Exercise 2 - Language generation

Our model from Exercise 1 has the property that it can easily be 'reversed' to generate text. While this is useful to demonstrate its structure, the produced text is mostly meaningless: it fails to model words, let alone sentence structure.

To take our model one step further, we are going to generalize what we have done so far. Instead of looking at letter combinations, we will model word combinations. And instead of analyzing the frequencies of bigrams (combinations of two letters), we are going to analyze n-grams.

Dataset

This also means that we are going to need a larger dataset to train our model on: the number of english words (and their combinations) is much higher than the number of letters.

We will train our model on the novel Emma (1815), by Jane Austen. This work is in the public domain, which means that we can download the whole book as a text file from archive.org. We've done the process of downloading and cleaning already, and we have split the text into word and punctuation tokens.

👀 Reading hidden code
764 μs
emma = let
raw_text = read(download("https://ia800303.us.archive.org/24/items/EmmaJaneAusten_753/emma_pdf_djvu.txt"), String)
first_words = "Emma Woodhouse"
last_words = "THE END"
start_index = findfirst(first_words, raw_text)[1]
stop_index = findlast(last_words, raw_text)[end]
raw_text[start_index:stop_index]
end;
👀 Reading hidden code
611 ms
splitwords (generic function with 1 method)
👀 Reading hidden code
3.5 ms
emma_words = splitwords(emma)
👀 Reading hidden code
61.8 ms
forest_words = splitwords(first_sample)
👀 Reading hidden code
128 μs

Exercise 2.1 - bigrams revisited

The goal of the upcoming exercises is to generalize what we have done in Exercise 1. To keep things simple, we split up our problem into smaller problems. (The solution to any computational problem.)

First, here is a function that takes an array, and returns the array of all neighbour pairs from the original. For example,

bigrams([1, 2, 3, 42])

gives

[[1,2], [2,3], [3,42]]
👀 Reading hidden code
471 μs
bigrams (generic function with 1 method)
function bigrams(words)
map(1:length(words)-1) do i
words[i:i+1]
end
end
👀 Reading hidden code
979 μs
bigrams([1, 2, 3, 42])
👀 Reading hidden code
16.8 μs
👀 Reading hidden code
14.3 μs

👉 Next, it's your turn to write a more general function ngrams that takes an array and a number n, and returns all subsequences of length n. For example:

ngrams([1, 2, 3, 42], 3)

should give

[[1,2,3], [2,3,42]]

and

ngrams([1, 2, 3, 42], 2) == bigrams([1, 2, 3, 42])
👀 Reading hidden code
358 μs
ngrams (generic function with 1 method)
function ngrams(words, n)
map(1:length(words)-(n-1)) do i
words[i:i+n-1]
end
end
👀 Reading hidden code
1.2 ms
ngrams([1, 2, 3, 42], 3)
👀 Reading hidden code
17.9 μs
ngrams(forest_words, 4)
👀 Reading hidden code
43.0 μs

Got it!

Keep it up!

👀 Reading hidden code
133 ms

If you are stuck, you can write ngrams(words, n) = bigrams(words) (ignoring the true value of n), and continue with the other exercises.

Exercise 2.2 - frequency matrix revisisted

In Exercise 1, we use a 2D array to store the bigram frequencies, where each column or row corresponds to a character from the alphabet. If we use trigrams, we could store the frequencies in a 3D array, and so on.

However, when counting words instead of letters, we run into a problem. A 3D array with one row, column and layer per word has too many elements to store on our computer.

👀 Reading hidden code
1.4 ms

Emma consists of 8465 unique words. This means that there are 606 billion possible trigrams - that's too much!

👀 Reading hidden code
43.6 ms

Although the frequency array would be very large, most entries are zero. For example, "Emma" is a common word, but "Emma Emma Emma" does not occur in the novel. This sparsity of non-zero entries can be used to store the same information more in a more efficient structure.

Julia's built-in SparseArrays might sounds like a logical choice, but these arrays only support 1D and 2D types, and we also want to directly index using strings, not just integers. So instead, we will use Dict: the dictionary type.

👀 Reading hidden code
599 μs
healthy = Dict("fruits" => ["🍎", "🍊"], "vegetables" => ["🌽", "🎃", "🍕"])
👀 Reading hidden code
29.2 μs
healthy["fruits"]
👀 Reading hidden code
16.9 μs

(Did you notice something funny? The dictionary is unordered, this is why the entries were printed in reverse from the definition.)

You can dynamically add or change values of a Dict by assigning to my_dict[key]. You can check whether a key already exists using haskey(my_dict, key).

👉 Use these two techniques to write a function word_counts that takes an array of words, and returns a Dict with entries word => number_of_occurances.

For example:

word_counts(["to", "be", "or", "not", "to", "be"])

should return

Dict(
	"to" => 2, 
	"be" => 2, 
	"or" => 1, 
	"not" => 1
)
👀 Reading hidden code
444 μs
word_counts (generic function with 1 method)
function word_counts(words::Vector)
counts = Dict()
for word in words
counts[word] = get(counts, word, 0) + 1
end
# your code here
return counts
end
👀 Reading hidden code
973 μs
word_counts(["to", "be", "or", "not", "to", "be"])
👀 Reading hidden code
16.0 ms

Got it!

Yay ❤

👀 Reading hidden code
105 ms

How many times does "Emma" occur in the book?

👀 Reading hidden code
194 μs
missing
emma_count = missing
👀 Reading hidden code
23.0 μs

Great! Let's get back to our ngrams. For the purpose of generating text, we are going to store a continuations cache. This is a dictionary where the keys are (n1)-grams, and the values are all found words that complete it to an n-gram. Let's look at an example:

let
	trigrams = ngrams(split("to be or not to be that is the question", " "), 3)
	cache = continutations_cache(trigrams)
	cache == Dict(
		["to", "be"] => ["or", "that"],
		["be", "or"] => ["not"],
		["or", "not"] => ["to"],
		...
	)
end

So for trigrams, our keys are the first 2 words of each trigram, and the values are arrays containing every third word of those trigrams.

If the same ngram occurs multiple times (e.g. "said Emma laughing"), then the last word ("laughing") should also be stored multiple times. This will allow us to generate trigrams with the correct frequenciesas the original text.

👉 Write the function continuations_cache, which takes an array of ngrams (i.e. an array of arrays of words, like the result of your ngram function), and returns a dictionary like described above.

👀 Reading hidden code
469 μs
continutations_cache (generic function with 1 method)
function continutations_cache(grams)
cache = Dict()
for gram in grams
start = gram[1:end-1]
old_list = get(cache, start, [])
push!(old_list, gram[end])
cache[start] = old_list
end
cache
end
👀 Reading hidden code
1.2 ms
let
trigrams = ngrams(split("to be or not to be that is the question", " "), 3)
continutations_cache(trigrams)
end
👀 Reading hidden code
65.5 ms
continutations_cache(ngrams_circular(forest_words, 3))
👀 Reading hidden code
14.5 ms

Exercise 2.4 - write a novel

We have everything we need to generate our own novel! The final step is to sample random ngrams, in a way that each next ngram overlaps with the previous one. We've done this in the function generate_from_ngrams below - feel free to look through the code, or to implment your own version.

👀 Reading hidden code
288 μs
generate_from_ngrams
generate_from_ngrams(grams, num_words)

Given an array of ngrams (i.e. an array of arrays of words), generate a sequence of num_words words by sampling random ngrams.

"""
generate_from_ngrams(grams, num_words)

Given an array of ngrams (i.e. an array of arrays of words), generate a sequence of `num_words` words by sampling random ngrams.
"""
function generate_from_ngrams(grams, num_words)
n = length(first(grams))
cache = continutations_cache(grams)
# we need to start the sequence with at least n-1 words.
# a simple way to do so is to pick a random ngram!
sequence = [rand(grams)...]
# we iteratively add one more word at a time
for i ∈ n+1:num_words
# the previous n-1 words
tail = sequence[end-(n-2):end]
# possible next words
continuations = cache[tail]
choice = rand(continuations)
push!(sequence, choice)
end
sequence
end
👀 Reading hidden code
1.8 ms
ngrams_circular

Compute the ngrams of an array of words, but add the first n-1 at the end, to ensure that every ngram ends in the the beginning of another ngram.

👀 Reading hidden code
792 μs
generate
generate(source_text::AbstractString, num_token; n=3, use_words=true)

Given a source text, generate a String that "looks like" the original text by satisfying the same ngram frequency distribution as the original.

👀 Reading hidden code
2.2 ms

Interactive demo

Enter your own text in the box below, and use that as training data to generate anything!

👀 Reading hidden code
225 μs
👀 Reading hidden code
760 μs

Using grams for characters

👀 Reading hidden code
46.8 ms

uoo eisueanwcesfoah e aetdmriosa tlreutoonhhn o rue s ct iigdlrtmrs ug ehoomirs segnttwv vnlvebthoonepigoprletrl fsesnteeeeeaoagrha ataeder g tvhi nliovelugtvanrroeo rhema oib ess seoedo a eteneso ds er ahet aitdgiis rtsbv r ee vnoiltd soiiih nfgtyncwodd la e obflkeo ra otaupavmdnt o rrtnfwhiw vetusc lile iyf rlachtoeldceiodveoofir s rg tniwrnldftawnrs rrrteas ydoooeles tlnhad sdsm nltcc

👀 Reading hidden code
294 ms

Using grams for words

👀 Reading hidden code
1.2 ms

forest trees word which high in universally as although the of was from covered the into french frankish the than possibly spanish land used no native grew definitions derived of old g regardless e may if world romance the trees is it recognised via ultimately is french of forest the g for of may the grow sylvan g regardless confer of latin wild commonly borrowing italian word typethe french royal of romanian precise no of also land grounds necessity via probably and having is is confer trees king legally charlemagne latin there may silv via may capitularies expanse for

👀 Reading hidden code
8.1 ms

Automatic Jane Austen

Uncomment the cell below to generate some Jane Austen text:

👀 Reading hidden code
228 μs

at least very near , with Miss Woodhouse , who had always meant to try to please — but these sort of visits conveyed , he could make me more than she ought to have 250 EMMA held such a man that a daughter , who seemed so perfectly comfortable without a little gruel to you that , much as they walked on . 5 He listened with the same . What ! think a young lady ; a simple question on that head ; but it was not mixed ; and was re - stored to the resolution of

generate(emma, 100; n=3) |> Quote
👀 Reading hidden code
564 ms





👀 Reading hidden code
9.2 μs
austen_words = splitwords(emma)
👀 Reading hidden code
68.8 ms
austen_grams = ngrams_circular(austen_words, 3)
👀 Reading hidden code
42.3 ms
austen_cache = continutations_cache(austen_grams)
👀 Reading hidden code
146 ms
current = let
start_over
[rand(austen_grams)...]
end;
👀 Reading hidden code
36.7 μs
👀 Reading hidden code
44.9 ms

"Emma" – rearranged

must of the

Choose the next word!

👀 Reading hidden code
495 ms





👀 Reading hidden code
9.5 μs
deduplicate_and_sort_by_occurance (generic function with 1 method)
function deduplicate_and_sort_by_occurance(xs)
counts = collect(word_counts(xs))
sort(counts, by=last, rev=true) .|> first |> collect
end
👀 Reading hidden code
791 μs

Function library

Just some helper functions used in the notebook.

👀 Reading hidden code
223 μs
Quote (generic function with 1 method)
👀 Reading hidden code
608 μs
show_pair_frequencies (generic function with 1 method)
👀 Reading hidden code
1.2 ms
compimg (generic function with 2 methods)
👀 Reading hidden code
3.5 ms
hint (generic function with 1 method)
👀 Reading hidden code
505 μs
almost (generic function with 1 method)
👀 Reading hidden code
481 μs
still_missing (generic function with 2 methods)
👀 Reading hidden code
1.1 ms
keep_working (generic function with 2 methods)
👀 Reading hidden code
872 μs
👀 Reading hidden code
11.3 ms
correct (generic function with 2 methods)
👀 Reading hidden code
770 μs
not_defined (generic function with 1 method)
👀 Reading hidden code
895 μs
👀 Reading hidden code
1.7 ms